Efficient protein structure archiving using ProteStAr

Abstract Motivation The introduction of Deep Minds’ Alpha Fold 2 enabled the prediction of protein structures at an unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of terabytes of data, which hinders the effective use of predicted structures in large-scale analyses. Results Here, we present ProteStAr, a compressor dedicated to CIF/PDB, as well as supplementary PAE files. Its main contribution is a novel approach to predicting atom coordinates on the basis of the previously analyzed atoms. This allows efficient encoding of the coordinates, the largest component of the protein structure files. The compression is lossless by default, though the lossy mode with a controlled maximum error of coordinates reconstruction is also present. Compared to the competing packages, i.e. BinaryCIF, Foldcomp, PDC, our approach offers a superior compression ratio at established reconstruction accuracy. By the efficient use of threads at both compression and decompression stages, the algorithm takes advantage of the multicore architecture of current central processing units and operates with speeds of about 1 GB/s. The presence of Python and C++ API further increases the usability of the presented method. Availability and implementation The source code of ProteStAr is available at https://github.com/refresh-bio/protestar.


Introduction
For over 50 years, structural biologists have been collecting atomic-level models of proteins in Protein Data Bank (PDB) (Berman et al. 2000).Nowadays it contains more than 200 thousand experimentally verified protein structures and is an invaluable resource for researchers.Still, this number is orders of magnitude smaller than the number of known proteins.The recent publication of deep learning methods like AlphaFold2 (Jumper et al. 2021) and ESMFold (Lin et al. 2023) was, therefore, a game changer.These tools can produce high-quality predictions of the 3D structure of a protein in minutes on a workstation.This allows individual researchers to fold the proteins they work on by themselves.
In large studies, however, involving thousands or even millions of structures, the prediction costs can still be prohibitive.Therefore, the aforementioned utilities have recently been used to construct large databases like AlphaFold Protein Structure Database (APSD) (Varadi et al. 2021) with about 214M predictions or ESM Atlas (Lin et al. 2023) with 772M entries.This translates to, respectively, 23 and 17 TB of gzip archives (this is how the databases are distributed) and a few times larger volumes of uncompressed data.Storing a local copy of them or even downloading them is challenging as well as time and money-consuming.
The data in ESM Atlas are provided in textual PDB format (Westbrook et al. 2022).A key component of PDB files and the major contributor to their overall size is a table of Cartesian atom coordinates.The data in APSD are provided in PDB and mmCIF (Westbrook et al. 2022) formats.The latter is also textual (with a similar atom table) but can contain additional fields.An important feature of AlphaFold2 is that it produces also Predicted Aligned Error (PAE) files storing the accuracy of the relative positions of each pair of atoms.The PAE files are given in JSON format and, when gzipped, contribute to the APSD database size similarly as PDBs.
In recent years, several attempts have been made to develop a compact representation of PDB and mmCIF files that would outperform general-purpose compressors like gzip.An interesting approach is BinaryCIF (Sehnal et al. 2020).The most important idea behind it is to handle mmCIF tables in a column-wise manner and store numeric fields deltacompressed (i.e.store a difference between the current value and the value in the previous row).These files can be further gzip-compressed to achieve an even better compression ratio.In the recent Foldcomp (Kim et al. 2023) compressor for PDB files, the coordinates of amino acid atoms are stored in records of a fixed size.The algorithm stores bond and torsion angles as well as side-chain angles and uses them during the decompression to reproduce atom coordinates.This is space efficient, but the method is inherently lossy.The backbone and side-chain atom positions are reproduced with an average error of �80 and �140mÅ, respectively, though the maximum error may still exceed 1 Å.By redefining the distance between atoms whose positions are stored exactly (25 by default), the user may reduce the reconstruction error at the cost of the compression ratio.
The most recent method, also designed for PDB format, is PDC (Zhang and Pyle 2023).In the lossless mode, it stores the coordinates deferentially, similarly to BinaryCIF.An interesting idea used here is to reorganize atoms within a residue to minimize the differences between successively encoded entries.In the lossy mode, PDC stores approximate coordinate differences between neighboring Cα atoms as well as torsion and side chain angles.Kim et al. (2023) evaluated several other approaches to compress protein structures including PULCHRA (Rotkiewicz and Skolnick 2008), MMTF (Bradley et al. 2017), PIC (Staniscia and Yu 2023).Though, BinaryCIF, Foldcomp, and PDC can be considered as the current stateof-the-art.What is important for huge datasets, only Foldcomp allows storing many files in a single archive and provides fast random access to selected structures.This significantly reduces the maintenance cost as it prevents keeping and navigating hundreds of millions of separate files on the disk.
In this article, we introduce ProteStAr, a novel compression approach for PDB, mmCIF, and PAE files.It allows storing many files in a single archive and decompressing them on demand.For this purpose, we developed two compression components.The first one handles PDB and mmCIF files.A key idea here is to predict the coordinates of successive atoms and store only the errors of the predictions.As the throughput of the algorithm was of crucial importance, the prediction technique is simple and particularly suited for compression, thus, should not be treated as anything comparable to AlphaFold or ESMFold.The second ProteStAr component is PAE compression.Since these files contain square matrices of size equal to the number of atoms in the structure, our algorithm uses some ideas from lossless image compressors.

General overview of the archive format
The ProteStAr archive allows storing many files of the following protein-structure-related formats: � protein structures in PDB and mmCIF formats, � predicted align errors in PAE format (produced mainly by AlphaFold), � confidence files in JSON format containing accuracy of predictions of residues (produced mainly by AlphaFold).
The format can be extended in the future if new file types need to be archived.The tool is written in the Cþþ17 programming language and is provided as a command-line utility allowing compression, decompression of all/selected items, and listing.Moreover, we provide a decompression library for Python and Cþþ.
Every file is compressed separately.This allows rapid access to any file in the archive and easy extension of the existing archives by new entries at the cost of slightly worse compression ratio than if all files were compressed jointly.

Compression of PDB and mmCIF files
Since PDB and mmCIF formats are closely related and store similar data, we designed a unified compressor for them, differing mainly by a parser component.
Each file is split into sections, which can be a block or a table.We gather all block sections and then compress them using a general-purpose ZSTD compressor.Processing of tables is more complicated.First, we determine a table type, which could be one of the following: generic, atom, hetatm.A table is classified as atom if it contains one or more chains of amino acids with typical order of atoms and without missing fields.Similarly, a table is classified as hetatm if it contains information on HETATMs without missing data.The remaining tables are classified as generic, which allows handling various types of input files.
A generic table is processed in a column-wise manner.For a numeric column (determined by the contents), we check if it is of some special kind, i.e. all values are equal, and differences between neighbor rows are the same.If so, the minimum information needed to reconstruct the column is stored.Otherwise, we calculate the differences between the neighbor values (delta-coding) and store them using an entropy coder [in particular, a range coder (Schindler 1998)].If the column contains coordinates (but the table is generic for any reason) and the user selected lossy compression mode (will be explained in the following subsections), we reduce the resolution of the values and encode the differences between neighboring numbers using an entropy coder.The remaining columns are concatenated and ZSTD-compressed.
An atom table usually is a major part of PDB/mmCIF file, so efficient compression of it is crucial.Since the order of atoms in every amino acid is established and we know the sequence of amino acids, we do not need to store atom types.The compression of Cartesian coordinates is complex and will be explained in the next subsection.Our tool lets the user decide whether the compression should be lossless (default) or lossy.In the latter case, the user can specify the max.error of position of the backbone as well as side-chain atoms.If provided, we round the coordinates accordingly (which will be described in the next subsection) and handle them losslessly.This is an important advantage over Foldcomp and PDC.They also support lossy compression of coordinates, but the user cannot specify the error bounds, and the individual errors could significantly exceed the average.
In the case of AlphaFold2, predictions of the B-factor field are usually constant for a residue.If this is the case, we store these values compactly.Otherwise, we store separate B-factors for every atom, but the user can provide a flag telling the tool that B-factors should be averaged over a residue.B-factors are delta-coded and stored using an entropy coder.The remaining columns in atom tables are handled in a similar way like in generic tables.
Currently, hetatm tables are handled in the same way as generic tables, but we plan to implement more sophisticated variants using some ideas from atom sections.

Compression of atom coordinates
The existing methods, like Foldcomp and PDC, make use of the classic description of how the protein structure is folded.They store angles, which can be used to reconstruct the atom coordinates.For the sake of compactness, the angles are saved with a limited precision, making the reconstruction also imprecise.Moreover, the errors can accumulate, which is the main reason why the algorithms are unable to control the max.error.
Our approach works differently and is free of this disadvantage.We predict each atom's coordinates using the three closest, already processed atoms (reference atoms).The candidates are backbone atoms from the previous residue as well as the atoms that have already been stored in the current 2 Deorowicz and Gudy� s residue (this is why some fixed ordering of atoms in residues is required).
The prediction is done with the use of pre-trained models that were incorporated into the tool.As the training data, we used Human proteome from APSD v4 comprising of about 186k mmCIF files.The package contains the learning module, so if necessary, the user can repeat this stage on his/her own mmCIF/PDB files and recompile the tool.Nevertheless, changing the model makes the archives incompatible, so the module is provided mostly for research purposes.
In the first step of training, for each atom in each residue type, we calculate the distances to the atoms that could serve as references.Then, for each atom type and each residue type, we pick three reference atom types that are on average the closest.
In the second step of training, for each atom type of each residue type, we collect information about up to N ¼ 10 6 appearances of this atom type.Each observation is stored as a 6-tuple hr 12 ; r 13 ; r 23 ; f 1 ; f 2 ; f 3 i containing distances between three reference residues as well as distances between the fourth atom and the references.Each such record can be treated as a description of a tetrahedron.These tetrahedrons are used to predict atom coordinates at the compression stage.
In the third training step, we perform the clustering of 6tuples using the well-known k-means algorithm (Forgy 1965).We try clustering for k ¼ 1; . . .; 20 centroids and pick the value that minimizes the estimated cost of the coordinate encoding.The process of selection of k mimics (in some sense) what we do in compression, so we will go back to this step after describing how the compression works.Now, let us assume that we want to encode the coordinates of the current atom CE1 in HIS.The model shows that the current residue's reference atoms are ND1, CD2, and CG.The model contains three centroids for this atom type that define three tetrahedrons.Now we examine all the tetrahedrons and check which of them allows prediction of the current atom position with the lowest error.Conceptually, we construct a tetrahedron for each centroid and then replace its base by changing the distances between the reference atoms to the distances in the current situation, as they are known at both the compression and decompression runs.We keep the distances to the fourth atom as they are in the centroid.Such modified tetrahedrons predict the positions of the fourth atom.We compare these predictions with the position of the current atom and select the best one.An illustration of this process is given in Fig. 1.
Technical details are a bit more complicated.First, the tetrahedrons are not normalized [i.e. the first reference atom is not at (0, 0, 0)].We use the real coordinates of the reference atoms in the current situation and construct tetrahedrons in real space.Then, we estimate the encoding cost of the prediction error as: Thus, it may happen that we will select a worse prediction (i.e. the larger Euclidean distance) if it could be encoded cheaper.After selection of the best centroid, we entropyencode its identifier, Δx; Δy; Δz, and the identifier of the tetrahedron (for each centroid, we can construct two equivalent tetrahedrons and we check both).If replacing the base makes constructing a tetrahedron with preserved distances to the fourth atom impossible, the positions of the first two reference atoms are returned as predicted positions.Handling of the first three atoms of a chain is special: as there are no reference atoms for them, their coordinates are delta-encoded.
In the lossy mode, the user provides the max.error for backbone atoms and, separately, for side chain atoms.Then, prior to processing, we round the input coordinates of all atoms following the formula: where r ¼ 2m= ffi ffi ffi 3 p and m is the max.error.Then, we prune the model for each atom type by removing the centroids that are identical after coordinates rounding, which can potentially reduce the cost of encoding centroid id.
We also provide a simplified selection of a centroid for prediction.Instead of examining all of them, we just calculate First, we take from the model the centroids describing tetrahedrons.Second, we calculate a tetrahedron for the current situation.Third, we use the distances of the current atom to the reference atoms to adjust tetrahedrons from the model.Fourth, we calculate the distances between the predictions of the position of the current atom given by the model and the current one to find the best prediction.
Efficient protein structure archiving using ProteStAr the 6-tuple for the current atom and look for the closest centroid in the model (Euclidean distance is used here).This is significantly faster but leads to a bit worse compression ratio.The user can select this optional mode during the compression.
Now, let us go back to the learning stage.When we decide which value of k (the number of centroids) to use, we try to estimate the encoding cost of the N collected observations using the current model.The cost is estimated using the formula (1) increased by the cost of encoding the centroid id: log 2 ð1 þ kÞ.We pick the value of k that minimizes the estimated cost of encoding.In our model, the value of k varies between 1 (N in HIS) and 20 (CG in GLU).

Compression of PAE files
Predicted align errors produced by AlphaFold2 tell how trustworthy the relative positions of every pair of atoms are.Since a file stores a square matrix containing integer values from the range ½0; 32�, our approach uses some ideas from the lossless image compression methods like PNG (Salomon and Motta 2010), FLIF (Sneyers and Wuille 2016), JPEG-XL (Alakuijala et al. 2019).
We process the matrix row-wise and column-wise within a row.We predict the value of the current item v(i, j) taking into account the neighboring items.Let us denote the prediction as e(i, j).First, we decide if the current item should be predicted horizontally or vertically.To do this, we check statistics (built on the already processed part of the matrix) on how frequently the difference between an item and its upper neighbor was smaller in the jth column than between an item and its left neighbor in the ith row.We choose the direction that more likely leads to the smaller difference.Without loss of generality, let us assume that e(i, j) is predicted horizontally (the other case is symmetric).If i ¼ j, we predict 0 since this is PAE value for the same atom.If ji − jj ¼ 1, we predict 1. Otherwise we check if vði; j − 1Þ − vði − 1; jÞ > 4. If so we predict eði; jÞ ¼ vði; j − 1Þ − 1.If not and vði − 1; jÞ − vði; j − 1Þ > 4 we predict eði; jÞ ¼ vði; j − 1Þ þ 1.
Then, we entropy-encode the error of prediction, i.e. eði; jÞ − vði; jÞ in this context.There are 8100 possible contexts, which allow the statistics to be stored in the fast cache memory, making the algorithm fast.Slightly better compression ratios would be possible for large matrices at the cost of a more complicated model and slower processing.
As for some purposes, the full resolution of PAE files is not necessary, we also propose a lossy scheme.The processing is the same, with the only important difference being "rounding" the original values according to the scheme presented in Table 1 and small differences in how the context is determined.

Compression of other file types
In the current version of the tool, we also allow to storage of confidence files in JSON format.Such files are small as they contain only 3 arrays of size equal to the number of residues.They contain residue id (integer), confidence score (float value), and confidence category (single character).Thus, we ZSTD-compress them.
The archive format is, however, open, so it is possible to add other file types in the future.

Experimental results
We used two large databases of predicted protein structures in the experiments: APSD v4 and ESM Atlas containing 214M and 772M structures, respectively.The testing machine was equipped with a 64-core AMD 3995WX Pro CPU clocked at 2.7 GHz and 512 GiB RAM running under openSUSE Tumbleweed operating system.The disks were NVME Seagate FireCuda 530 4 TB and RAID5 composed of four Seagate Exos 16 TB HDDs.The experiments were carried out on the NVME disk if not stated otherwise.Details on the datasets, tools, and command lines used in this study are given in the Supplementary data.
As a first step, we evaluated the performance of the tools on 542k AlphaFold predictions of SwissProt proteins as well as a subset of 59k predictions of various qualities from ESM Atlas database.Foldcomp and ProteStAr, as the only multi-threaded packages in the analysis, were run with 16 computing threads.Since gzip and BinaryCIF are one-to-one compressors, the collection of their output files (one per structure) was gathered using the well-known tar utility.This was to avoid space overhead related to storing on disk hundreds of thousands of files.The summary of the experiments is presented in Fig. 2. Detailed results are given in the Supplementary Worksheet.
When considering lossless compression, our algorithm was four times better than gzip and two times better than state-ofthe-art BinaryCIF (for mmCIF files) and PDC (for PDB files).Storing only essential parts of PDB files (the header and the table with atom descriptions) in ProteStAr minimal mode further increased the compression ratio.Our tool's compression and decompression speeds were about 500 and 2400 MB/s, respectively-significantly higher than those of the competitors.
For the lossy compression, ProteStAr was compared to PDC and Foldcomp.As these tools store only the header and atom table of PDB files, our tool was configured in the same way.Moreover, ProteStAr, unlike competitors, allows controlling max.error of atom coordinates reconstruction (i.e. the euclidean distance in Å between corresponding atoms in the original  , 19, 20-21, 22-23, 24-27, 28-32 2 0, 1, … , 9, 10-12, 13-15, 16-19, 20-23, 24-27, 28-32 3 0, 1, … , 5, 6-7, 8-10, 11-14, 15-19, 20-25, 26-32 4 0, 1, 2, 3, 4-5, 6-8, 9-12, 13-17, 18-24, 25-32 a Every value from ½'; r� range is reconstructed during decompression as b ' þ r 2 c. and decompressed files).Therefore, we investigated several lossy variants of our algorithm with error rates comparable to the competing tools.In particular, series ProteStAr 80/140 with the error bounds equal to 80 and 140 mÅ for backbone and side chain atoms, respectively, was selected as a direct competitor of Foldcomp.Analogously, ProteStAr 200/300 was compared with PDC.As the results on AlphaFold SwissProt show, both these variants outperformed state-of-the-art methods in terms of compression ratio.Interestingly, even ProteStAr 10/100 mode reduced the input data size by a factor of 43, which is roughly ten times better than gzip.This shows how much space can be saved when using a specialized tool and allows the reduction of precision of atom coordinates.Unfortunately, we were not able to evaluate any of the competitors on 59k subset from ESM Atlas.PDC failed to process all the files, while Foldcomp produced invalid archives for files inconsistent with the PDB format specification (e.g.missing atoms, errors in the labeling of residues), which are present in ESM Atlas.However, the results of ProteStAr were similar to those observed for AlphaFold SwissProt predictions.
In the next experiment, we evaluated the error of atom coordinates reconstruction of Foldcomp, PDC, and selected lossy modes of ProteStAr.The analysis relied on compressing and decompressing all AlphaFold predictions of mouse proteome (21 615 files) and comparing the reconstructed atom positions with the input ones.
The resulting per-atom error histograms for backbones and side chains as well as compression ratios are presented in Fig. 3a.The analogous histograms of per-structure root mean squared deviations (RMSD) and mean absolute errors (MAE) are shown in Supplementary Figs. 1 and 2. As one can observe, the reconstruction accuracy of ProteStAr was always within the requested bounds, while the errors of PDC and FoldComp were sometimes significantly larger than the average.In Supplementary Section 5, we listed fragments of original and reconstructed PDB files containing atoms with the largest reconstruction error (five for each of the investigated packages).
As a next step, we measured the impact of max.allowed reconstruction error on ProteStAr compression ratio.The results for APSD E.coli dataset, containing 10.5 million atoms (from 2500 randomly selected files), are presented in Fig. 3b.As one can see, the file size rapidly dropped from 38.1 MB for lossless compression to 15.9 MB for max.errors of 100 mÅ and then the improvement became moderate.This can be partially explained by the fact that when the coordinates are rounded, the accuracy of the prediction of the fourth atom position decreases.Nevertheless, for 500 mÅ error bound, the archive size was 13.4 MB with approximately 10 MB used for atom coordinates.This translates to <8 bits for storing 3D coordinates of a single atom.
Table 2 shows the sizes of the archives produced by the examined tools.We used five proteomes from AlphaFold database, SwissProt proteins, some subset of ESM Atlas and the full ESM Atlas (version v0 and the supplement).The largest dataset was used to show the scalability of the examined tools.ProteStAr, depending on the mode, needed from 29 to 31 h to complete for ESM Atlas v0.What is also important, the ProteStAr compression ratios were almost independent on the dataset, in spite of training internal models on human AlphaFold predictions.All the investigated packages are memory efficient-the maximal observed memory usage in the experiments were 7 MB for gzip, 12 MB for PDC, 170 MB for Foldcomp, and 311 MB for ProteStAr.
In the final experiment, we evaluated the compression of PAE files.We are not aware of any specialized compressor for this format, so we compared ProteStAr with gzip, which Efficient protein structure archiving using ProteStAr was used for storing PAE in AlphaFold database.Since PAE files are useless if not accompanied with PDB/mmCIF structures, we used the entire APSD human proteome composed of 186 016 triples of files: (i) mmCIF describing a structure, (ii) PAE showing the accuracy of the relative position of every pair of atoms, (iii) confidence of the prediction of each residue.The results are presented in Fig. 4. As one can observe, in the lossless mode, ProteStAr compressed PAE files two times better than gzip, while in the lossy schemes, the advantage over gzip was much larger.When we focus on the size of the whole archive, we can see that the initial dataset of size 119.2GB could be packed in the lossless mode about three times better than gzip, to just 7.5 GB.This drops to 2.5 GB if we decide to the most aggressive scheme in which atom coordinates are stored with 200/300 mÅ max.error, and the PAE value resolution is reduced from 33 to 11 distinct values.The compression and decompression speeds of PAE files ranged from 800 MB/s (lossless mode) to 950 MB/s (lossy modes).When we consider all file types together, these values were 600-790 MB/s in compression and 1090-1270 MB/s in decompression.The memory footprint of ProteStAr was about 2 GB in compression and 1 GB in decompression.
The same APSD human proteome dataset (558 054 files in total, 7.5 GB in ProteStAr lossless or 21.0 GB in tar) was used to check the random access speed of ProteStAr.The results highly depended on where the archive resided (NVME, HDD), how large was the file compared to the RAM of the machine, and whether the file was cached by OS (due to previous queries) or not.To evaluate this, we performed the experiments in three scenarios: (i) file at NVME and cached by OS, (ii) file at NVME, not cached, (iii) file at HDD, not cached.For ProteStAr the time to list the archive contents was less than a second in all examined cases.For the tarred archive, the times were, respectively, 2.8, 2.9, 14.4 s.Extraction of a single file by ProteStAr took from 0.02 to 0.25 s depending on the file size and these times were almost the same in the examined scenarios.The times for tar were 1.0, 13.2, 31.6 s, respectively.Clearly, using tar was significantly more time-consuming, especially when data were not cached.Alternatively, one can store gzips as they are (without tar), allowing fast random access to the data at the cost of dealing with hundreds of millions of files, which can be problematic.As Foldcomp does not support PAE and confidence files, we were not able to include it in this comparison.Nevertheless, for databases with PDB/mmCIF structures, the extraction of a single file was similarly fast as for ProteStAr.
As a final experiment, we measured the performance of Python and Cþþ libraries using the same APSD human proteome dataset.Times of opening, listing, and extracting any file were significantly less than a second, which shows that the compressed archives can be safely used in real pipelines without worries about extra time for accessing the data.

Conclusions
With rapidly increasing sizes of databases of protein structure predictions, the general-purpose gzip would soon become insufficient in terms of both the compression ratio and the processing speed.Our tool, ProteStAr, is able to compress mmCIF and PDB files losslessly four times better than gzip and two times better than state-of-the-art BinaryCIF and PDC algorithms.If this reduction is not enough, one can use lossy compression with a controlled error of position reconstruction.This feature, not provided by any of the existing algorithms, allows compacting the data approximately ten times better than gzip, while maintaining maximum reconstruction error at very good levels: 10 mÅ for backbone and 100 mÅ for side chain atoms.All this is obtained at compression/decompression rates comparable to disk throughput (�1 GB per second) with very fast random access to the archive entries.The functionality of the software is complemented by the support of nonstructure files distributed in the prediction databases (PAE, confidence).
Importantly, ProteStAr, unlike the competitors, is resistant to the inconsistencies with the mmCIF/PDB format specification and was able to analyze the entire ESM Atlas database.The largest dataset is split into the initial (v0) release and the supplement (v2023_02).
Efficient protein structure archiving using ProteStAr

Figure 1 .
Figure 1.Illustration of how the prediction of the current atom works.First, we take from the model the centroids describing tetrahedrons.Second, we calculate a tetrahedron for the current situation.Third, we use the distances of the current atom to the reference atoms to adjust tetrahedrons from the model.Fourth, we calculate the distances between the predictions of the position of the current atom given by the model and the current one to find the best prediction.

Figure 2 .
Figure 2. Comparison of CIF/PDB compressors for subsets of APSD and ESM Atlas databases.Compression ratios are calculated as input_size/ compressed_size.Series ProteStAr x/y represent a lossy variant of our algorithm with max.error of x mÅ for backbone atoms and y mÅ for side chain atoms.

Figure 3 .6
Figure 3. Evaluation of atom coordinates reconstruction error.(a) Histograms of per-atom reconstruction errors for backbone (red) and side chain (blue) atoms.Median, average, and maximum errors are also presented.Green bars show compression ratio (input_size/compressed_size).(b) Impact of the maximum allowed reconstruction error (same for backbone and side chain) on the overall archive size (orange) and size of the coordinates alone (purple).

Table 1 .
Grouping of PAE values for supported lossy levels.a